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1  Introduction 


During  the  2007  and  2008  Technology  Insertion  initiatives  (TI07  &  TI08),  the  Department  of  De¬ 
fense  (DoD)  High  Performance  Computing  Modernization  Program  (HPCMP)  acquired  and  de¬ 
ployed  several  High  Performance  Computing  (HPC)  systems  in  the  Cray  XT  series.  Two  of  these 
systems  are  the  Cray  XT4  located  at  the  U.S.  Army  Engineer  Research  and  Development  Center 
(ERDC)  DoD  Supercomputing  Resource  Center  (DSRC)  and  the  Cray  XT5  located  at  the  U.S.  Army 
Research  Eaboratory  (ARE)  DSRC.  Each  DSRC  sponsored  a  pioneer  user  access  initiative  for  the 
purpose  of  conducting  performance  studies  of  the  two  systems  prior  to  production  utilization. 
During  this  pioneer  user  period,  the  scalable  performance  of  the  CTH  shock  physics  code  was 
evaluated  on  each  system. 

CTH  is  an  explicit,  Eulerian,  finite  volume  code  for  the  numerical  simulation  of  the  high-rate  re¬ 
sponse  of  materials  to  impulsive  loads  [1].  CTH  is  widely  used  across  the  defense  research  and 
development  complex  in  the  development  of  explosives,  blast  and  fragmenting  warheads,  kinetic 
energy  penetrators,  vehicle  armor  systems,  protective  structures,  etc.  CTH  employs  a  single  pro¬ 
gram  multiple  data  (SPMD)  programming  model  to  employ  scalable  performance  on  distributed 
memory  HPC  systems.  CTH  has  been  found  to  achieve  linear  scalability  in  performance  on  many 
systems  deployed  by  the  HPCMP  since  the  late  1990s  [2,  3, 4,  5, 6]. 


2  HPC  Systems 


The  Cray  XT4  system  installed  at  the  ERDC  DSRC  is  configured  with  2152  compute  nodes,  each 
of  which  has  a  2.1-GHz  Advanced  Micro  Devices  (AMD)  quad-core  processor,  resulting  in  a  total 
of  8608  processor  cores.  Each  compute  node  has  8  GB  of  memory,  resulting  in  average  memory 
availability  of  2  GB  per  processor  core,  and  a  total  system  memory  of  approximately  16.4  TB  [7]. 
When  the  XT4  was  initially  installed,  it  was  configured  with  dual-core  AMD  processors.  The 
compute  nodes  were  later  upgraded  to  quad-core  processors.  The  CTH  scalability  study  was 
performed  on  the  XT4  for  both  dual-core  and  quad-core  configurations. 
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The  Cray  XT5  system  installed  at  the  ART  DSRC  is  configured  with  1300  compute  nodes,  each  of 
which  has  two  2.3-GHz  AMD  quad-core  processors,  resulting  in  a  total  of  10,400  processor  cores. 
Each  compute  node  has  32  GB  of  memory,  resulting  in  an  average  memory  availability  of  4  GB  per 
processor  core,  and  a  total  system  memory  of  41.6  TB.  Both  the  XT4  and  XT5  employ  a  proprietary 
high-speed  network  for  communication  between  compute  nodes  [8]. 


3  Scalability  Study  Parameters 


The  scalability  of  GTH  on  the  XT  systems  was  determined  through  a  series  of  simulations  that 
employed  both  fixed  and  adaptive  meshes.  The  benchmark  simulation  that  was  employed  for  the 
study  involved  the  yawed,  oblique  impact  of  a  depleted  uranium  long  rod  penetrator  against  a 
steel  target  plate.  This  benchmark  problem  was  selected  because  of  its  relevance  to  DoD  prob¬ 
lems  in  shock  physics.  A  thorough  description  of  the  benchmark  simulation  has  been  previously 
documented  [2, 3, 4, 5,  6] . 

The  fixed-mesh  scalability  simulations  were  conducted  with  a  nearly  constant  workload.  This  was 
done  to  keep  the  computation-to-communication  ratio  as  close  to  constant  as  possible  for  simu¬ 
lations  involving  different  numbers  of  processor  cores  (one  GTH  task  assigned  to  each  processor 
core).  Maintaining  a  nearly  constant  computation-to-communication  ratio  and  minimizing  disk 
access  for  intermediate  plot  and  restart  files  during  the  time  integration  permitted  the  computa¬ 
tional  performance  to  be  isolated  and  measured  as  a  function  of  the  number  of  processor  cores 
used. 

As  the  number  of  GTH  tasks  was  increased,  the  fixed  mesh  was  incrementally  refined  by  uni¬ 
formly  decreasing  the  characteristic  cell  size  in  each  coordinate  direction  by  the  nearest  integer 
factor  of  2“^/^.  This  approach  approximately  doubles  the  total  number  of  Eulerian  cells  with  each 
successive  mesh  refinement.  The  characteristics  of  the  meshes  used  in  the  scalability  study  are 
summarized  in  Table  1.  In  this  table,  the  columns  NI,  NJ,  and  NK  refer  to  the  number  of  Eulerian 
cells  in  the  x,  y,  and  z  directions,  respectively.  The  mesh  sizes  listed  in  the  table  produce  com¬ 
putational  sub-domains  containing  approximately  387,000  Eulerian  cells  each.  Eor  the  8192-task 
simulation,  this  results  in  a  computational  domain  containing  approximately  3  billion  Eulerian 
cells. 

The  adaptive  mesh  refinement  (AMR)  capability  in  GTH  allows  the  definition  of  the  mesh  to 
change  during  the  simulation  based  on  the  evolving  characteristics  of  the  simulation  [9].  The 
adaptation  of  the  mesh  is  based  on  user-defined  indicators,  such  as  the  value,  gradient,  or  differ¬ 
ence,  of  a  variable  in  the  solution  (pressure,  density,  velocity,  stress,  etc.).  This  technique  results 
in  simulations  in  which  the  most  highly  resolved  mesh  "follows"  the  activity  of  interest  to  the  an¬ 
alyst  while  using  less  highly  resolved  mesh  in  the  remainder  of  the  computational  domain.  This 
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Table  1.  Scalability  study  parameters. 


Number 

of  Tasks 

Fixed  Mesh 

Adaptive  Mesh 

N1 

NJ 

NK 

Total 

Number 

of  Cells 

Cell  Size 
(mm) 

Maximum 

Refinement 

Level 

Minimum 

Cell  Size 
(mm) 

1 

215 

30 

60 

387,000 

1.000 

5 

0.750 

2 

271 

38 

75 

772,350 

0.793 

5 

0.750 

4 

341 

48 

95 

1,554,960 

0.630 

5 

0.750 

8 

430 

60 

120 

3,096,000 

0.500 

6 

0.375 

16 

541 

76 

151 

6,208,516 

0.397 

6 

0.375 

32 

683 

95 

191 

12,393,035 

0.315 

6 

0.375 

64 

860 

120 

240 

24,768,000 

0.250 

7 

0.188 

128 

1083 

151 

302 

49,386,966 

0.199 

7 

0.188 

256 

1366 

190 

382 

99,144,280 

0.157 

7 

0.188 

512 

1720 

240 

480 

198,144,000 

0.125 

8 

0.094 

1024 

2166 

302 

604 

395,095,728 

0.099 

8 

0.094 

2048 

2732 

380 

764 

793,154,240 

0.079 

8 

0.094 

4096 

3440 

480 

960 

1,585,152,000 

0.063 

9 

0.047 

8192 

4334 

605 

1210 

3,172,704,700 

0.050 

9 

0.047 

allows  the  analyst  to  configure  highly  resolved  simulations  that  have  fewer  total  computational 
cells  than  a  comparable  fixed-mesh  simulation  having  the  same  minimum  cell  size. 

The  AMR  implementation  in  CTH  is  a  block-based  scheme  in  which  each  block  consists  of  an 
orthogonal  mesh  with  a  fixed  number  of  cells  in  the  x,  y,  and  z  directions.  The  blocks  are  connected 
in  a  hierarchical  manner  with  adjacent  blocks  having  either  exactly  the  same  cell  size  or  exactly  a 
2:1  ratio  in  cell  size.  Refinement  or  un-refinement  of  the  mesh  is  accomplished  through  a  series  of 
transitions  of  adjacent  blocks  with  a  difference  in  mesh  density  of  2:1.  All  mesh  blocks  at  a  given 
mesh  density  are  at  the  same  refinement  level.  The  finest  mesh  resolution  that  can  exist  in  the 
computational  domain  is  controlled  by  defining  the  maximum  refinement  level  of  the  mesh. 

The  AMR  CTH  benchmark  used  in  the  scalability  study  was  configured  to  be  physically  identical 
to  the  fixed-mesh  simulation.  The  only  difference  between  the  fixed-mesh  simulation  and  the 
AMR  simulation  was  the  definition  of  the  mesh.  The  size  of  the  mesh  in  the  AMR  simulation  was 
scaled  with  the  number  of  CTH  tasks  in  a  manner  similar  to  the  fixed-mesh  study.  However,  it  is 
not  possible  to  precisely  scale  the  total  number  of  cells  in  the  AMR  simulation  since  the  refinement 
and  un-refinement  indicators  are  based  on  the  physics,  not  the  topology  of  the  computational 
domain.  Thus,  to  scale  the  size  of  the  simulation  in  a  controlled  manner,  the  maximum  refinement 
level  was  increased  by  one  for  every  factor  of  eight  increase  in  the  number  of  tasks.  The  2:1  ratio 
of  cell  size  between  refinement  levels  results  in  a  factor  of  approximately  eight  in  the  total  number 
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of  cells  in  the  3-D  simulation.  The  variation  of  the  maximum  refinement  level  and  the  resulting 
minimum  cell  size  with  the  number  of  tasks  used  is  summarized  in  Table  1. 

The  scalable  performance  of  CTH  is  measured  by  the  "grind  time,"  which  is  the  average  time 
required  for  the  code  to  update  all  field  variables  for  one  computational  cell  in  a  given  time  incre¬ 
ment  (cycle).  In  a  case  of  ideal  scalability,  the  grind  time  will  decrease  by  a  factor  of  two  for  every 
doubling  of  number  of  processor  cores  used  if  the  ratio  of  computation  to  communication  is  held 
constant. 


4  Study  Results:  Cray  XT4 


The  results  of  the  CTH  scalability  study  on  the  XT4  are  provided  in  Figure  1.  In  this  figure,  the 
fixed  mesh  study  results  are  plotted  in  the  chart  on  the  left  of  the  figure  and  the  adaptive  mesh 
results  are  plotted  in  the  chart  on  the  right.  Both  dual-core  and  quad-core  results  are  provided  in 
each  chart.  The  results  demonstrate  that  CTH  achieves  linear  scalability  on  the  XT4  up  to  2048 
CTH  tasks,  the  maximum  number  considered  in  the  study  on  this  system. 


Fixed  Mesh 


Adaptive  Mesh 


Figure  1.  Scalability  on  XT4  with  dual-  and  quad-core  processors. 


4 


The  CTH  simulations  run  on  the  XT4  were  executed  in  such  a  way  as  to  completely  fill  the  assigned 
compute  nodes  whenever  possible.  For  example,  a  job  using  eight  CTH  tasks  under  the  quad-core 
configuration  would  use  two  nodes. 

The  linear  scalability  of  CTH  can  be  described  by  gn  =  gi/ n™  where  gn  is  the  expected  grind 
time  on  n  processor  cores,  gi  is  the  measured  grind  time  on  one  processor  core,  and  m  is  the 
parallel  efficiency,  the  slope  of  the  scalability  line.  A  slope  of  1  corresponds  to  perfect  scalability. 
A  regression  analysis  was  performed  on  the  measured  performance  data  to  determine  the  parallel 
efficiency  of  CTH  on  the  systems  under  consideration.  In  both  fixed  and  adaptive  mesh  results, 
the  parallel  efficiency  on  the  XT4  was  slightly  better  for  the  dual-core  processors  than  the  quad- 
core  processors.  However,  this  performance  penalty  is  far  outweighed  by  the  doubling  in  total 
system  capability  by  replacing  the  dual-core  processors  with  quad-core  processors. 


5  Study  Results:  Cray  XT5 


When  the  scalability  simulations  were  run  on  the  XT5,  they  were  configured  to  use  every  possible 
power-of-two  combination  of  nodes  and  tasks  per  node  to  achieve  the  desired  total  number  of 
CTH  tasks.  For  example,  for  the  case  involving  4  total  CTH  tasks,  simulations  were  performed  on: 
(1)  one  node  using  four  processor  cores,  (2)  two  nodes  using  two  processor  cores  per  node,  and  (3) 
four  nodes  using  one  processor  core  per  node.  Executing  the  study  in  this  way  makes  it  possible  to 
determine  the  effect  of  the  number  of  CTH  tasks  assigned  per  node  on  the  overall  computational 
performance. 

The  results  of  the  fixed  mesh  CTH  scalability  study  on  the  XT5  are  provided  in  Figure  2.  The 
chart  on  the  left  of  this  figure  is  a  plot  of  the  grind  time  as  a  function  of  CTH  tasks,  similar  to 
that  provided  in  Figure  1.  In  this  chart,  the  data  are  plotted  separately  for  different  numbers  of 
CTH  tasks  assigned  to  each  node.  The  plot  shows  that  for  a  given  number  of  tasks  there  is  a  slight 
increase  in  the  grind  time  as  the  number  of  tasks  per  node  is  increased.  This  is  likely  a  result  of 
increased  memory  contention  on  the  node.  This  slight  penalty  associated  with  increased  tasks  per 
node  is  counterbalanced  by  the  fact  that  the  parallel  efficiency,  m,  increases  slightly  as  the  number 
of  tasks  per  core  is  increased. 

The  chart  on  the  right  of  Figure  2  attempts  to  quantify  the  effect  of  the  number  of  tasks  per  node 
on  the  performance.  In  this  plot,  for  each  total  number  of  CTH  tasks  assigned,  from  eight  to  1024, 
the  grind  time  is  plotted  relative  to  the  one  task-per-node  case.  This  plot  shows  that  as  the  number 
of  tasks  per  node  is  increased  to  the  maximum  of  eight  tasks  per  node,  the  grind  time  increases  by 
approximately  20%.  However,  the  cases  involving  the  largest  processor  counts  (512  and  1024)  had 
the  lowest  penalty  on  the  grind  time  as  the  number  of  tasks  per  node  was  increased.  Thus,  for  the 
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Figure  2.  Fixed  mesh  scalability  and  relative  performance  on  XT5. 


typical  operating  environment  in  which  all  processor  cores  in  the  compute  nodes  are  assigned  a 
task,  the  XT5  system  operates  most  efficiently  when  running  large  jobs. 

The  results  of  the  adaptive  mesh  CTH  scalability  study  on  the  XT5  are  provided  in  Figure  3.  This 
figure  plots  the  results  in  the  same  format  and  at  the  same  scale  as  the  fixed  mesh  results  in  Fig¬ 
ure  2.  The  findings  of  the  adaptive  mesh  scalability  study  on  the  XT5  are  similar  to  the  fixed 
mesh  findings:  (1)  the  parallel  efficiency  generally  increases  as  the  number  of  tasks  per  node  is  in¬ 
creased,  (2)  the  grind  time  ratio  increases  as  the  number  of  tasks  per  node  is  increased,  indicating 
increased  memory  contention  on  the  node,  and  (3)  the  grind  time  ratios  are  smallest  for  the  largest 
number  of  tasks  assigned. 


6  Summary 


Linear  scalability  of  CTH  on  the  Cray  XT4  and  XT5  systems  has  been  demonstrated  for  simulations 
using  up  to  8192  processor  cores.  The  linear  scalability  was  demonstrated  for  simulations  using 
both  fixed  and  adaptive  meshes.  At  the  time  of  installation  of  these  two  systems,  they  were  among 
the  largest  deployed  under  the  auspices  of  the  HPCMP.  This  fact,  combined  with  the  finding  that 
these  systems  operate  most  efficiently  for  simulations  that  use  large  numbers  of  processor  cores 
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Figure  3.  Adaptive  mesh  scalability  and  relative  performance  on  XT5. 


indicates  that  the  productivity  of  these  systems  would  be  maximized  by  targeting  large  jobs  to 
these  resources. 
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